26 research outputs found

    Multimodality, interactivity, and crowdsourcing for document transcription

    Full text link
    This is the peer reviewed version of the following article: Granell, Emilio, Romero, Verónica, Martínez-Hinarejos, Carlos-D.. (2018). Multimodality, interactivity, and crowdsourcing for document transcription.Computational Intelligence, 34, 2, 398-419. DOI: 10.1111/coin.12169, which has been published in final form at http://doi.org/10.1111/coin.12169.. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-Archiving.[EN] Knowledge mining from documents usually use document engineering techniques that allow the user to access the information contained in documents of interest. In this framework, transcription may provide efficient access to the contents of handwritten documents. Manual transcription is a time-consuming task that can be sped up by using different mechanisms. A first possibility is employing state-of-the-art handwritten text recognition systems to obtain an initial draft transcription that can be manually amended. A second option is employing crowdsourcing to obtain a massive but not error-free draft transcription. In this case, when collaborators employ mobile devices, speech dictation can be used as a transcription source, and speech and handwritten text recognition can be fused to provide a better draft transcription, which can be amended with even less effort. A final option is using interactive assistive frameworks, where the automatic system that provides the draft transcription and the transcriber cooperate to generate the final transcription. The novel contributions presented in this work include the study of the data fusion on a multimodal crowdsourcing framework and its integration with an interactive system. The use of the proposed solutions reduces the required transcription effort and optimizes the overall performance and usability, allowing for a better transcription process.projects READ, Grant/Award Number: 674943; (European Union's H2020); Smart Ways, Grant/Award Number: RTC-2014-1466-4; (MINECO); CoMUN-HaT, Grant/Award Number: TIN2015-70924-C2-1-R; (MINECO / FEDER)Granell, E.; Romero, V.; Martínez-Hinarejos, C. (2018). Multimodality, interactivity, and crowdsourcing for document transcription. Computational Intelligence. 34(2):398-419. https://doi.org/10.1111/coin.12169S39841934

    Handwriting recognition in historical documents using very large vocabularies

    Full text link
    © ACM 2013. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in HIP '13 Proceedings of the 2nd International Workshop on Historical Document Imaging and Processinghttp://dx.doi.org/10.1145/2501115.2501116Language models are used in automatic transcription system to resolve ambiguities. This is done by limiting the vocabulary of words that can be recognized as well as estimating the n-gram probability of the words in the given text. In the context of historical documents, a non-unified spelling and the limited amount of written text pose a substantial problem for the selection of the recognizable vocabulary as well as the computation of the word probabilities. In this paper we propose for the transcription of historical Spanish text to keep the corpus for the n-gram limited to a sample of the target text, but expand the vocabulary with words gathered from external resources. We analyze the performance of such a transcription system with different sizes of external vocabularies and demonstrate the applicability and the significant increase in recognition accuracy of using up to 300 thousand external words.This work has been supported by the European project FP7-PEOPLE-2008-IAPP: 230653 the European Research Council’s Advanced Grant ERC-2010-AdG 20100407, the Spanish R&D projects TIN2009-14633-C03-03, RYC-2009-05031, TIN2011-24631, TIN2012-37475-C02-02, MITTRAL (TIN2009-14633-C03-01), Active2Trans (TIN2012-31723) as well as the Swiss National Science Foundation fellowship project PBBEP2_141453.Frinken, V.; Fischer, A.; Martínez-Hinarejos, C. (2013). Handwriting recognition in historical documents using very large vocabularies. ACM. https://doi.org/10.1145/2501115.2501116

    Image speech combination for interactive computer assisted transcription of handwritten documents

    Full text link
    [EN] Handwritten document transcription aims to obtain the contents of a document to provide efficient information access to, among other, digitised historical documents. The increasing number of historical documents published by libraries and archives makes this an important task. In this context, the use of image processing and understanding techniques in conjunction with assistive technologies reduces the time and human effort required for obtaining the final perfect transcription. The assistive transcription system proposes a hypothesis, usually derived from a recognition process of the handwritten text image. Then, the professional transcriber feedback can be used to obtain an improved hypothesis and speed-up the final transcription. In this framework, a speech signal corresponding to the dictation of the handwritten text can be used as an additional source of information. This multimodal approach, that combines the image of the handwritten text with the speech of the dictation of its contents, could make better the hypotheses (initial and improved) offered to the transcriber. In this paper we study the feasibility of a multimodal interactive transcription system for an assistive paradigm known as Computer Assisted Transcription of Text Images. Different techniques are tested for obtaining the multimodal combination in this framework. The use of the proposed multimodal approach reveals a significant reduction of transcription effort with some multimodal combination techniques, allowing for a faster transcription process.Work partially supported by projects READ-674943 (European Union's H2020), SmartWays-RTC-2014-1466-4 (MINECO, Spain), and CoMUN-HaT-TIN2015-70924-C2-1-R (MINECO/FEDER), and by Generalitat Valenciana (GVA), Spain under reference PROMETEOII/2014/030.Granell, E.; Romero, V.; Martínez-Hinarejos, C. (2019). Image speech combination for interactive computer assisted transcription of handwritten documents. Computer Vision and Image Understanding. 180:74-83. https://doi.org/10.1016/j.cviu.2019.01.009S748318

    Review of Research on Speech Technology: Main Contributions From Spanish Research Groups

    Get PDF
    In the last two decades, there has been an important increase in research on speech technology in Spain, mainly due to a higher level of funding from European, Spanish and local institutions and also due to a growing interest in these technologies for developing new services and applications. This paper provides a review of the main areas of speech technology addressed by research groups in Spain, their main contributions in the recent years and the main focus of interest these days. This description is classified in five main areas: audio processing including speech, speaker characterization, speech and language processing, text to speech conversion and spoken language applications. This paper also introduces the Spanish Network of Speech Technologies (RTTH. Red Temática en Tecnologías del Habla) as the research network that includes almost all the researchers working in this area, presenting some figures, its objectives and its main activities developed in the last years

    Improving the automatic segmentation of subtitles through conditional random field

    Full text link
    [EN] Automatic segmentation of subtitles is a novel research field which has not been studied extensively to date. However, quality automatic subtitling is a real need for broadcasters which seek for automatic solutions given the demanding European audiovisual legislation. In this article, a method based on Conditional Random Field is presented to deal with the automatic subtitling segmentation. This is a continuation of a previous work in the field, which proposed a method based on Support Vector Machine classifier to generate possible candidates for breaks. For this study, two corpora in Basque and Spanish were used for experiments, and the performance of the current method was tested and compared with the previous solution and two rule-based systems through several evaluation metrics. Finally, an experiment with human evaluators was carried out with the aim of measuring the productivity gain in post-editing automatic subtitles generated with the new method presented.This work was partially supported by the project CoMUN-HaT - TIN2015-70924-C2-1-R (MINECO/FEDER).Alvarez, A.; Martínez-Hinarejos, C.; Arzelus, H.; Balenciaga, M.; Del Pozo, A. (2017). Improving the automatic segmentation of subtitles through conditional random field. Speech Communication. 88:83-95. https://doi.org/10.1016/j.specom.2017.01.010S83958

    Estimating the number of segments for improving dialogue act labelling

    Full text link
    In dialogue systems it is important to label the dialogue turns with dialogue-related meaning. Each turn is usually divided into segments and these segments are labelled with dialogue acts (DAs). A DA is a representation of the functional role of the segment. Each segment is labelled with one DA, representing its role in the ongoing discourse. The sequence of DAs given a dialogue turn is used by the dialogue manager to understand the turn. Probabilistic models that perform DA labelling can be used on segmented or unsegmented turns. The last option is more likely for a practical dialogue system, but it provides poorer results. In that case, a hypothesis for the number of segments can be provided to improve the results. We propose some methods to estimate the probability of the number of segments based on the transcription of the turn. The new labelling model includes the estimation of the probability of the number of segments in the turn. We tested this new approach with two different dialogue corpora: SwitchBoard and Dihana. The results show that this inclusion significantly improves the labelling accuracy. © Copyright Cambridge University Press 2011.Work supported by the EC (FEDER/FSE), the Spanish Government (MEC, MICINN, MITyC, MAEC, "Plan E", under grants MIPRCV "Consolider Ingenio 2010" CSD2007-00018, MITTRAL TIN2009-14633-C03-01, erudito.com TSI-020110-2009-439, FPI fellowship BES-2007-16834), and Generalitat Valenciana (grant Prometeo/2009/014 and grant ACOMP/2010/051).Tamarit Ballester, V.; Martínez-Hinarejos, C.; Benedí Ruiz, JM. (2012). Estimating the number of segments for improving dialogue act labelling. Natural Language Engineering. 18(1):1-19. doi:10.1017/S135132491000032XS119181Dybkjær, L., & Minker, W. (Eds.). (2008). Recent Trends in Discourse and Dialogue. Text, Speech and Language Technology. doi:10.1007/978-1-4020-6821-8Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, D., … Meteer, M. (2000). Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech. Computational Linguistics, 26(3), 339-373. doi:10.1162/089120100561737Fraser, N. M., & Gilbert, G. N. (1991). Simulating speech systems. Computer Speech & Language, 5(1), 81-99. doi:10.1016/0885-2308(91)90019-mSchatzmann J. , Thomson B. and Young S. 2007. Statistical user simulation with a hidden agenda. In Proceedings of the SIGdial Workshop on Discourse and Dialogue, pp. 273–82.Levin L. , Ries K. , Thymé-Gobbel A. , and Levie A. 1999. Tagging of speech acts and dialogue games in Spanish call home. In Workshop: Towards Standards and Tools for Discourse Tagging, pp. 42–7.Hinarejos, C. D. M., Granell, R., & Benedí, J. M. (2006). Segmented and unsegmented dialogue-act annotation with statistical dialogue models. Proceedings of the COLING/ACL on Main conference poster sessions -. doi:10.3115/1273073.1273146Benedí J. M. , Lleida E. , Varona A. , Castro M. J. , Galiano I. , Justo R. , López de Letona I. , and Miguel A. 2006. Design and acquisition of a telephone spontaneous speech dialogue corpus in Spanish: Dihana. In Fifth International Conference on Language Resources and Evaluation (LREC), pp. 1636–9.Young, S. J. (2000). Probabilistic methods in spoken–dialogue systems. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 358(1769), 1389-1402. doi:10.1098/rsta.2000.0593Martínez-Hinarejos, C.-D., Benedí, J.-M., & Granell, R. (2008). Statistical framework for a Spanish spoken dialogue corpus. Speech Communication, 50(11-12), 992-1008. doi:10.1016/j.specom.2008.05.011Bisani M. and Ney H. 2004. Bootstrap estimates for confidence intervals in asr performance evaluation. In Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on, vol. 1, pp. 1:I–409–12.Garcia, P., & Vidal, E. (1990). Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(9), 920-925. doi:10.1109/34.57687Jurafsky D. , Shriberg E. and Biasca D. 1997. Switchboard SWBD-DAMSL shallow discourse function annotation coders manual - draft 13. Technical Report 97-01, University of Colorado Institute of Cognitive Science.Gorin, A. ., Riccardi, G., & Wright, J. . (1997). How may I help you? Speech Communication, 23(1-2), 113-127. doi:10.1016/s0167-6393(97)00040-xWalker, M. A. (2000). An Application of Reinforcement Learning to Dialogue Strategy Selection in a Spoken Dialogue System for Email. Journal of Artificial Intelligence Research, 12, 387-416. doi:10.1613/jair.713Core M. G. and Allen J. F. 2007. Coding dialogues with the DAMSL annotation scheme. In Fall Symposium on Communicative Action in Humans and Machines. American Association for Artificial Intelligence, pp. 28–35.Fukada T. , Koll D. , Waibel A. and Tanigaki K. 1998. Probabilistic dialogue act extraction for concept based multilingual translation systems. ICSLP 98 2771–4

    Impact of automatic segmentation on the quality, productivity and self-reported post-editing effort of intralingual subtitles

    Get PDF
    This paper describes the evaluation methodology followed to measure the impact of using a machine learning algorithm to automatically segment intralingual subtitles. The segmentation quality, productivity and self-reported post-editing effort achieved with such approach are shown to improve those obtained by the technique based in counting characters, mainly employed for automatic subtitle segmentation currently. The corpus used to train and test the proposed automated segmentation method is also described and shared with the community, in order to foster further research in this are

    Towards Children-Centred Trustworthy Conversational Agents

    Get PDF
    Conversational agents (CAs) have been increasingly used in various domains, including education, health and entertainment. One of the growing areas of research is the use of CAs with children. However, the development and deployment of CAs for children come with many specific challenges and ethical and social responsibility concerns. This chapter aims to review the related work on CAs and children, point out the most popular topics and identify opportunities and risks. We also present our proposal for ethical guidelines on the development of trustworthy artificial intelligence (AI), which provide a framework for the ethical design and deployment of CAs with children. The chapter highlights, among other principles, the importance of transparency and inclusivity to safeguard user rights in AI technologies. Additionally, we present the adaptation of previous AI ethical guidelines to the specific case of CAs and children, highlighting the importance of data protection and human agency. Finally, the application of ethical guidelines to the design of a conversational agent is presented, serving as an example of how these guidelines can be integrated into the development process of these systems. Ethical principles should guide the research and development of CAs for children to enhance their learning and social development

    The Percepción Smart Campus system

    Get PDF
    Ponènica presentada a IberSPEECH 2014, VIII Jornadas en Tecnología del Habla and IV Iberian SLTech Workshop, celebrat a Las Palmas de Gran Canaria els dies 19-21 de novembre de 2014This paper presents the capabilities of the Smart Campus system developed during the Percepcion project. The Smart Campus system is able to locate the user of the application in a limited environment, including indoor location. The system is able to show routes and data (using virtual reality) on the different elements of the environment. Speech queries could be used to locate places and get routes and information on that places

    Transcription of Spanish Historical Handwritten Documents with Deep Neural Networks

    Full text link
    [EN] The digitization of historical handwritten document images is important for the preservation of cultural heritage. Moreover, the transcription of text images obtained from digitization is necessary to provide efficient information access to the content of these documents. Handwritten Text Recognition (HTR) has become an important research topic in the areas of image and computational language processing that allows us to obtain transcriptions from text images. State-of-the-art HTR systems are, however, far from perfect. One difficulty is that they have to cope with image noise and handwriting variability. Another difficulty is the presence of a large amount of Out-Of-Vocabulary (OOV) words in ancient historical texts. A solution to this problem is to use external lexical resources, but such resources might be scarce or unavailable given the nature and the age of such documents. This work proposes a solution to avoid this limitation. It consists of associating a powerful optical recognition system that will cope with image noise and variability, with a language model based on sub-lexical units that will model OOV words. Such a language modeling approach reduces the size of the lexicon while increasing the lexicon coverage. Experiments are first conducted on the publicly available Rodrigo dataset, which contains the digitization of an ancient Spanish manuscript, with a recognizer based on Hidden Markov Models (HMMs). They show that sub-lexical units outperform word units in terms of Word Error Rate (WER), Character Error Rate (CER) and OOV word accuracy rate. This approach is then applied to deep net classifiers, namely Bi-directional Long-Short Term Memory (BLSTMs) and Convolutional Recurrent Neural Nets (CRNNs). Results show that CRNNs outperform HMMs and BLSTMs, reaching the lowest WER and CER for this image dataset and significantly improving OOV recognition.Work partially supported by projects READ: Recognition and Enrichment of Archival Documents - 674943 (European Union's H2020) and CoMUN-HaT: Context, Multimodality and User Collaboration in Handwritten Text Processing - TIN2015-70924-C2-1-R (MINECO/FEDER), and a DGA-MRIS (Direction Generale de l'Armement - Mission pour la Recherche et l'Innovation Scientifique) scholarship.Granell, E.; Chammas, E.; Likforman-Sulem, L.; Martínez-Hinarejos, C.; Mokbel, C.; Cirstea, B. (2018). Transcription of Spanish Historical Handwritten Documents with Deep Neural Networks. Journal of imaging. 4(1). https://doi.org/10.3390/jimaging4010015S154
    corecore